Data Reduction via Adaptive Sampling ∗
نویسنده
چکیده
Data reduction is an important issue in the field of data mining. This article describes a new method for selecting a subset of data from a large dataset. A simplified chi-square criterion is proposed for measuring the goodness-of-fit between the distributions of the reduced and full data sets. Under this criterion, the data reduction problem can be formulated as a binary quadratic program and a tabu search technique is used in the search/optimization process. The procedure is adaptive in that it involves not only random sampling but also deterministic search guided by the results of the previous search steps. The method is applicable primarily to discrete data, but can be extended to continuous data as well. An experimental study that compares the proposed method with simple random sampling on a number of simulated and real world datasets has been conducted. The results of the study indicate that the distributions of the samples produced by the proposed method are significantly closer to the true distribution than those of random samples. 1. Introduction. In recent years, we have observed an explosion of electronic data generated and collected by individuals, corporations, and government agencies. It was estimated several years ago that the amount of data in the world was doubling every twenty months [5]. By current standards, that estimate is no doubt too conservative. The widespread use of bar codes and scanning devices for commercial products, the computerization of business and government transactions, the rapid development of electronic commence over the Internet, and the advances in storage technology and database management systems have allowed us to generate and store mountains of data. This rapid growth in data and databases has created the problem of data overload. There has been an urgent need for new techniques and tools that can extract useful information and knowledge from massive volumes of data. Consequently , an emerging field, known as data mining, has flourished in the past several years [4]. Data mining is the process of discovering hidden patterns in databases. The entire process includes (loosely) three steps: (1) data preparation, which includes data collection, data cleaning, data reduction and data transformation; (2) pattern exploration, which involves developing (or using existing) algorithms and computer programs to discover the patterns of interest; and (3) implementation, in which the patterns discovered in the previous step are used to solve real world problems such as credit evaluation, fraud detection, and …
منابع مشابه
A Data Reduction Algorithm for Magnetic Measurement Pre-processing at CERN
The principle of a data reduction algorithm, based on real-time adaptive sampling, specifically optimized for high-rate automatic measurement systems, is proposed. An adaptive sampling rule based on power estimation allows the optimum amount of information to be gathered in real time. The sampling rate is adapted when the limit conditions of insufficient/redundant information for the required s...
متن کاملColumn Selection via Adaptive Sampling
Selecting a good column (or row) subset of massive data matrices has found many applications in data analysis and machine learning. We propose a new adaptive sampling algorithm that can be used to improve any relative-error column selection algorithm. Our algorithm delivers a tighter theoretical bound on the approximation error which we also demonstrate empirically using two well known relative...
متن کاملData reduction in the ITMS system through a data acquisition model with self-adaptive sampling rate
Long pulse or steady state operation of fusion experiments require data acquisition and processing systems that reduce the volume of data involved. The availability of self-adaptive sampling rate systems and the use of real-time lossless data compression techniques can help solve these problems. The former is important for continuous adaptation of sampling frequency for experimental requirement...
متن کاملIntelligent Control of a Sensor-Actuator System via Kernelized Least-Squares Policy Iteration
In this paper a new framework, called Compressive Kernelized Reinforcement Learning (CKRL), for computing near-optimal policies in sequential decision making with uncertainty is proposed via incorporating the non-adaptive data-independent Random Projections and nonparametric Kernelized Least-squares Policy Iteration (KLSPI). Random Projections are a fast, non-adaptive dimensionality reduction f...
متن کاملAdaptive modeling, adaptive data assimilation and adaptive sampling
For efficient progress, model properties and measurement needs can adapt to oceanic events and interactions as they occur. The combination of models and data via data assimilation can also be adaptive. These adaptive concepts are discussed and exemplified within the context of comprehensive real-time ocean observing and prediction systems. Novel adaptive modeling approaches based on simplified ...
متن کاملA Comparative Study of Performance of Adaptive Web Sampling and General Inverse Adaptive Sampling in Estimating Olive Production in Iran
Nowadays, there is an increasing use of sampling methods in network and spatial populations. Although the most common link-tracing designs such as adaptive cluster sampling and snowball sampling have advantages over conventional sampling designs such as simple random sampling and cluster sampling, these designs still present many drawbacks. Adaptive web sampling is a new link-tracing design tha...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002